library("knitr")
library("reprex")
library("tidyverse")
Go on this link to download R: https://cran.rstudio.com/
Select the version that works for your operating system, and download the latest release (R-3.6.0).
Figure 1.1: Download R.
Once you’ve downloaded R, install it following the instructions on the screen.
Go on this link to download R Studio: https://www.rstudio.com/products/rstudio/download/#download
And then download the version that works for your operating system.
Figure 1.2: Download R Studio.
Once you’ve downloaded R Studio, install it following the instructions on the screen.
Figure 3.1: General preferences.
Make sure that:
Figure 3.2: Code window preferences.
Make sure that:
This way you don’t have to scroll horizontally. At the same time, avoid writing long single lines of code. For example, instead of writing code like so:
ggplot(data = diamonds, aes(x = cut, y = price)) +
stat_summary(fun.y = "mean", geom = "bar", color = "black", fill = "lightblue", width = 0.85) +
stat_summary(fun.data = "mean_cl_boot", geom = "linerange", size = 1.5) +
labs(title = "Price as a function of quality of cut", subtitle = "Note: The price is in US dollars", tag = "A", x = "Quality of the cut", y = "Price")
You may want to write it this way instead:
ggplot(data = diamonds, aes(x = cut, y = price)) +
# display the means
stat_summary(fun.y = "mean",
geom = "bar",
color = "black",
fill = "lightblue",
width = 0.85) +
# display the error bars
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1.5) +
# change labels
labs(title = "Price as a function of quality of cut",
subtitle = "Note: The price is in US dollars", # we might want to change this later
tag = "A",
x = "Quality of the cut",
y = "Price")
This makes it much easier to see what’s going on, and you can easily add comments to individual lines of code.
RStudio makes it easy to write nice code. It figures out where to put the next line of code when you press ENTER. And if things ever get messy, just select the code of interest and hit cmd + i to re-indent the code.
Here are some more tips on how to write nice code in R:
There are three simple ways to get help in R. You can either put a ? in front of the function you’d like to learn more about, or use the help() function.
?print
help("print")
Tip: To see the help file, hover over a function (or dataset) with the mouse (or select the text) and then press
F1.
I recommend using F1 to get to help files – it’s the fastest way!
R help files can sometimes look a little cryptic. Most R help files have the following sections (copied from here):
Title: A one-sentence overview of the function.
Description: An introduction to the high-level objectives of the function, typically about one paragraph long.
Usage: A description of the syntax of the function (in other words, how the function is called). This is where you find all the arguments that you can supply to the function, as well as any default values of these arguments.
Arguments: A description of each argument. Usually this includes a specification of the class (for example, character, numeric, list, and so on). This section is an important one to understand, because arguments are frequently a cause of errors in R.
Details: Extended details about how the function works, provides longer descriptions of the various ways to call the function (if applicable), and a longer discussion of the arguments.
Value: A description of the class of the value returned by the function.
See also: Links to other relevant functions. In most of the R editors, you can click these links to read the Help files for these functions.
Examples: Worked examples of real R code that you can paste into your console and run.
Here is the help file for the print() function:
Figure 3.3: Help file for the print() function.
What makes R powerful is the large number of packages that have been written for R. You can install a new package like so:
install.packages("tidyverse")
You can also install multiple packages at the same time, by concatenating the package names using the c() function:
install.pacakges(c("tidyverse","broom"))
To make sure that your packages remain up to date, you can go to Tools > Check for Package Updates ... in R Studio.
Figure 3.4: Help file for the print() function.
You can then click Select All and then Install Updates.
Figure 3.5: Help file for the print() function.
R Studio might ask you to restart your R session before updating the packages.
Tools > Keyboard Shortcuts Helpbase R vs. tidyverse
two different coding styles
the pipe
The order in which packages in R are loaded matters!
You can refer to functions from specific packages by adding the function name at the beginning.
For example, this command would use the select() function from the MASS package MASS::select(), while this command would use the function from the dplyr package dplyr::select().
Always load library("tidyverse") last because it loads a large number of functions that are frequently used.
df.data = read_csv(file = "../../data/top2018songs.csv") %>%
mutate(rank = 1:nrow(.))
| column | description |
|---|---|
| id | Spotify URI of the song |
| name | Name of the song |
| artists | Artist(s) of the song |
| danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. |
| loudness | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | Predicts whether a track contains no vocals. ‘Ooh’ and ‘aah’ sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly ‘vocal’. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | The duration of the track in milliseconds. |
| time_signature | An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). |
The quickest way to take a look at your data is to hover your mouse over a variable of a data frame, and press
F2.
We should always take a look at the data first.
include_graphics("../../figures/plots/bad_plot1.png")
Figure 4.1: A not so good plot.
include_graphics("../../figures/plots/bad_plot2.jpg")
Figure 4.2: Another could-be-improved plot.
This second plots reminded me of the following:
include_graphics("../../figures/plots/correlation_aint_causation.png")
Figure 4.3: Correlation is not causation.
Just because two lines look similar, doesn’t mean that anything interesting is going on – it certainly doesn’t mean that the two phenomena represented by the lines are causally connected. For more inspiration check out this site https://www.tylervigen.com/spurious-correlations.
Figure 4.4: The Datasaurus Dozen. While different in appearance, each dataset has the same summary statistics to two decimal places (mean, standard deviation, and Pearson’s correlation).
The data sets in Figure 4.4 all share the same summary statistics. Clearly, the data sets are not the same though.
Tip: Always plot the data first!
Here is the paper from which I took Figure 4.4. It explains how the figures were generated and shows more examples for how summary statistics and some kinds of plots are insufficient to get a good sense for what’s going on in the data.
include_graphics("../../figures/plots/box_violin.gif")
Figure 4.5: Boxplots can be misleading.
ggplot2ggplot(data = df.data,
mapping = aes(x = danceability,
y = valence)) +
geom_point()
ggplot(data = df.data,
mapping = aes(x = danceability,
y = valence)) +
geom_point() +
geom_smooth(method = "lm")
ggplot(data = df.data,
mapping = aes(x = mode,
y = valence)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange") +
stat_summary(fun.y = "mean",
geom = "point",
size = 3)
Surprising! Songs in a minor key (mode = 0) sound more positive than songs in a major key (mode = 1).
Here is a more involved plot that shows some of the things you can do with ggplot2:
df.plot = df.data %>%
mutate(mode = factor(mode,
levels = c(0, 1),
labels = c("minor", "major")),
key = factor(key,
levels = 0:11,
labels = c("C", "C#", "D", "D#",
"E", "F", "F#", "G",
"G#", "A", "A#", "B")))
ggplot(data = df.plot,
mapping = aes(x = key,
y = energy,
group = mode,
fill = mode)) +
# add individual data points
geom_point(mapping = aes(color = mode),
position = position_jitterdodge(dodge.width = 0.7,
jitter.width = 0.1,
jitter.height = 0),
alpha = 0.3) +
# add the error bars
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
position = position_dodge(width = 0.7),
size = 0.75) +
# add the mean data points
stat_summary(fun.y = "mean",
geom = "point",
shape = 21,
size = 3,
position = position_dodge(width = 0.7)) +
# add the vertical lines
geom_vline(data = tibble(key = 1:10),
xintercept = seq(from = 1.5, to = 11.5, by = 1),
linetype = 2,
color = "gray80") +
# set title and subtitle of plot
labs(title = "Energy for songs with different key and mode",
subtitle = "Energy represents a perceptual measure of intensity and activity.") +
# change the y-axis
scale_y_continuous(breaks = seq(0.25, 1, 0.25),
labels = seq(0.25, 1, 0.25),
limits = c(0.25, 1)) +
# set the fill color
scale_fill_brewer(palette = "Set1") +
# change the plotting theme
theme_classic() +
# adjust the text size
theme(text = element_text(size = 16),
plot.subtitle = element_text(size = 12))
lm()lmer()brm() (using library("brms"))ggplot2.ggplot2.ggplot2.sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.1 purrr_0.3.2
[5] readr_1.3.1 tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.0
[9] tidyverse_1.2.1 reprex_0.3.0 knitr_1.23
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 lubridate_1.7.4 lattice_0.20-38
[4] assertthat_0.2.1 digest_0.6.19 R6_2.4.0
[7] cellranger_1.1.0 backports_1.1.4 acepack_1.4.1
[10] evaluate_0.14 httr_1.4.0 highr_0.8
[13] pillar_1.4.1 rlang_0.3.4 lazyeval_0.2.2
[16] readxl_1.3.1 data.table_1.12.2 rstudioapi_0.10
[19] rpart_4.1-15 Matrix_1.2-17 checkmate_1.9.3
[22] rmarkdown_1.13 labeling_0.3 splines_3.6.0
[25] foreign_0.8-71 htmlwidgets_1.3 munsell_0.5.0
[28] broom_0.5.2 compiler_3.6.0 modelr_0.1.4
[31] xfun_0.7 pkgconfig_2.0.2 base64enc_0.1-3
[34] htmltools_0.3.6 nnet_7.3-12 tidyselect_0.2.5
[37] gridExtra_2.3 htmlTable_1.13.1 bookdown_0.11
[40] Hmisc_4.2-0 crayon_1.3.4 withr_2.1.2
[43] grid_3.6.0 nlme_3.1-140 jsonlite_1.6
[46] gtable_0.3.0 magrittr_1.5 scales_1.0.0
[49] cli_1.1.0 stringi_1.4.3 fs_1.3.1
[52] latticeExtra_0.6-28 xml2_1.2.0 generics_0.0.2
[55] Formula_1.2-3 RColorBrewer_1.1-2 tools_3.6.0
[58] glue_1.3.1 hms_0.4.2 survival_2.44-1.1
[61] yaml_2.2.0 colorspace_1.4-1 cluster_2.0.9
[64] rvest_0.3.4 haven_2.1.0